Introduction to Distributional Regression

Lecture 5

Gillian Heller

NHMRC Clinical Trials Centre, University of Sydney

gillian.heller@sydney.edu.au

Discrete distributions

Discrete distributions: unbounded

  • \mathcal{S}=\{0,1,2,\ldots\}

  • Response is generally a count

  • Starting point: Poisson distribution

\begin{align*} y&\sim\text{PO}(\mu)\\ \mathbb{E}(y)&=\text{Var}(y)=\mu \end{align*}

  • Problems

    • Overdispersion/heavy tail \quad \text{Var}(y)>\mathbb{E}(y)

    • Excess/lack of zeroes

Discrete distributions



ZA = zero adjusted, ZI = zero inflated

Discrete distributions

Zero inflation

\begin{equation*} \begin{split} \mathbb{P}(Y=y)= \begin{cases}\textcolor{red}{\pi}+(1-\pi) \mathbb{P}\left(Y_1=0\right) & \text { if } y=0 \\ (1-\pi) \mathbb{P}\left(Y_1=y\right) & \text { if } y=1,2,3, \ldots \end{cases} \end{split} \end{equation*}

  • \mathbb{P}(Y_1) is the probability function of the response variable without zero inflation

  • \mathbb{P}(Y) is that of the zero-inflated response distribution.

  • Assumption: a proportion of the population (\pi) generates zero with certainty, and the remaining (1-\pi) generates counts according to the assumed response distribution.


Examples

  • number of falls, number of blood transfusions

  • species abundance counts

  • number of wildfires

  • etc …

Discrete distributions

Zero adjustment (aka hurdle models)

\begin{equation*} \begin{split} \mathbb{P}(Y=y)= \begin{cases}\textcolor{red}{\pi} & \text { if } y=0 \\ (1-\pi) \mathbb{P}\left(Y_2=y\right) & \text { if } y=1,2,3, \ldots \end{cases} \end{split} \end{equation*}

  • Y_2 is the zero-truncated version of the response variable.
  • In ZA (hurdle) model, the zero probability can be inflated or deflated with respect to the original zero probability

  • In ZI model, only inflation is possible.

  • If you need a ZI or ZA version of a distribution that isn’t supplied by gamlss (e.g. ZIWARING), this can be user-defined.

Discrete distributions

Bounded counts

  • \mathcal{S}=\{0, 1, \ldots, n\}

  • These generally are binomial-type responses

  • See Discrete distributions table for more information

Discrete distributions table

Application

APTS study

  • The Australian Placental Transfusion Study (APTS, Tarnow-Mordi et al., 2017) is an international, multicentre, randomized controlled trial of 1566 preterm infants born at 30 weeks gestation or earlier.

  • The trial compared

    • immediate cord clamping (ICC): clamping the umbilical cord within 10 seconds of birth
    • delayed cord clamping (DCC): clamping the cord at 60 seconds or longer
  • Primary outcome: death or major morbidity at 36 weeks gestation.

  • Here we analyse the number of blood transfusions received by the infant by 36 weeks gestation

The data

ICC
N = 773

DCC
N = 780

p-value1

Number of transfusions

0.001

Mean (SD)

2.01 (2.89)

1.64 (2.58)

Minimum, Maximum

0, 30

0, 24

Received transfusion, n (%)

468 (61%)

406 (52%)

<0.001

1Wilcoxon rank sum test; Pearson's Chi-squared test


Choose response distribution

library(gamlss)


APTSdata |> 
  dplyr::select(Treatment, transfusions, gestationalage) |>
  na.omit() -> APTSsubset

m1 = gamlss(transfusions ~ Treatment + gestationalage,
            sigma.fo=~Treatment + gestationalage,
            nu.formula =~Treatment + gestationalage,
            tau.formula = ~Treatment + gestationalage,
            data=APTSsubset, n.cyc=100, trace = FALSE)

chooseDist(m1, type="counts") 
minimum GAIC(k= 2 ) family: ZISICHEL 
minimum GAIC(k= 3.84 ) family: ZISICHEL 
minimum GAIC(k= 7.35 ) family: SI 
                2     3.84     7.35
PO       5609.248 5614.768 5625.298
GEOM     5033.518 5039.038 5049.568
GEOMo    5033.518 5039.038 5049.568
LG             NA       NA       NA
YULE     5634.311 5639.831 5650.361
ZIPF           NA       NA       NA
WARING   5005.754 5016.794 5037.854
GPO      4931.551 4942.591 4963.651
DPO      5041.813 5052.853 5073.913
BNB      4918.446 4935.006 4966.596
NBF      4933.775 4950.335 4981.925
NBI      4938.126 4949.166 4970.226
NBII     4938.126 4949.166 4970.226
PIG      4930.380 4941.420 4962.480
ZIP      5234.362 5245.402 5266.462
ZIP2     5269.020 5280.060 5301.120
ZAP      5235.848 5246.888 5267.948
ZALG     5116.385 5127.425 5148.485
DEL      4940.846 4957.406 4988.996
ZAZIPF   5534.915 5545.955 5567.015
SI       4907.584 4924.144 4955.734
SICHEL   4919.541 4936.101 4967.691
ZANBI    4980.174 4996.734 5028.324
ZAPIG    9099.858 9116.418 9148.008
ZINBI    4960.485 4977.045 5008.635
ZIPIG    5553.932 5570.492 5602.082
ZINBF    4936.486 4958.566 5000.686
ZABNB          NA       NA       NA
ZASICHEL 4893.147 4915.227 4957.347
ZINBF    4936.486 4958.566 5000.686
ZIBNB    4912.423 4934.503 4976.623
ZISICHEL 4892.522 4914.602 4956.722

Choose response distribution

  • The zero-inflated Sichel distribution is chosen

  • The Sichel is a 3-parameter discrete distribution, capable of accommodating long tails:

Select covariates: stepwise

m0 <- gamlss(transfusions ~ gestationalage, data=APTSsubset, family = ZISICHEL, n.cyc=50)
GAMLSS-RS iteration 1: Global Deviance = 5002.486 
GAMLSS-RS iteration 2: Global Deviance = 4966.412 
GAMLSS-RS iteration 3: Global Deviance = 4960.555 
GAMLSS-RS iteration 4: Global Deviance = 4958.091 
GAMLSS-RS iteration 5: Global Deviance = 4957.213 
GAMLSS-RS iteration 6: Global Deviance = 4957.069 
GAMLSS-RS iteration 7: Global Deviance = 4957.059 
GAMLSS-RS iteration 8: Global Deviance = 4957.058 
GAMLSS-RS iteration 9: Global Deviance = 4957.058 
m.step <- stepGAICAll.A(m0, 
                scope=list(lower=~1, 
                           upper=~Treatment + gestationalage))
--------------------------------------------------- 
Distribution parameter:  mu 
Start:  AIC= 4967.06 
 transfusions ~ gestationalage 

            Df    AIC
+ Treatment  1 4953.9
<none>         4967.1

Step:  AIC= 4953.91 
 transfusions ~ gestationalage + Treatment 

--------------------------------------------------- 
Distribution parameter:  sigma 
Start:  AIC= 4953.91 
 ~1 

                 Df    AIC
+ gestationalage  1 4923.4
<none>              4953.9
+ Treatment       1 4955.9

Step:  AIC= 4923.35 
 ~gestationalage 

            Df    AIC
+ Treatment  1 4923.1
<none>         4923.4

Step:  AIC= 4923.13 
 ~gestationalage + Treatment 

--------------------------------------------------- 
Distribution parameter:  nu 
Start:  AIC= 4923.13 
 ~1 

                 Df    AIC
+ gestationalage  1 4901.8
<none>              4923.1
+ Treatment       1 4924.9

Step:  AIC= 4901.82 
 ~gestationalage 

            Df    AIC
<none>         4901.8
+ Treatment  1 4906.9
--------------------------------------------------- 
Distribution parameter:  tau 
Start:  AIC= 4901.82 
 ~1 

                 Df    AIC
+ gestationalage  1 4891.0
<none>              4901.8
+ Treatment       1 4903.3

Step:  AIC= 4890.95 
 ~gestationalage 

                 Df    AIC
+ Treatment       1 4890.8
<none>              4891.0
- gestationalage  1 4901.8

Step:  AIC= 4890.85 
 ~gestationalage + Treatment 

                 Df    AIC
<none>              4890.8
- Treatment       1 4891.0
- gestationalage  1 4903.3
--------------------------------------------------- 
Distribution parameter:  nu 
Start:  AIC= 4890.85 
 ~gestationalage 

                 Df    AIC
<none>              4890.8
- gestationalage  1 4900.2
--------------------------------------------------- 
Distribution parameter:  sigma 
Start:  AIC= 4890.85 
 ~gestationalage + Treatment 

                 Df    AIC
- Treatment       1 4890.7
<none>              4890.8
- gestationalage  1 4891.6

Step:  AIC= 4890.74 
 ~gestationalage 

                 Df    AIC
<none>              4890.7
- gestationalage  1 4892.8
--------------------------------------------------- 
Distribution parameter:  mu 
Start:  AIC= 4890.74 
 transfusions ~ gestationalage + Treatment 

                 Df    AIC
<none>              4890.7
- Treatment       1 4892.7
- gestationalage  1 5044.0
--------------------------------------------------- 

Stepwise selection

summary(m.step)
******************************************************************
Family:  c("ZISICHEL", "Zero inflated Sichel") 

Call:  gamlss(formula = transfusions ~ gestationalage + Treatment,  
    sigma.formula = ~gestationalage, nu.formula = ~gestationalage,  
    tau.formula = ~gestationalage + Treatment, family = ZISICHEL,  
    data = APTSsubset, n.cyc = 50, trace = FALSE) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  log
Mu Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    11.04865    0.61220  18.047   <2e-16 ***
gestationalage -0.37903    0.02355 -16.094   <2e-16 ***
TreatmentDCC   -0.12442    0.06496  -1.915   0.0556 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
               Estimate Std. Error t value Pr(>|t|)
(Intercept)     125.108    109.199   1.146    0.252
gestationalage   -4.241      3.706  -1.144    0.253

------------------------------------------------------------------
Nu link function:  identity 
Nu Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -27.3718     6.4285  -4.258 2.19e-05 ***
gestationalage   0.8821     0.2319   3.803 0.000148 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Tau link function:  logit 
Tau Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -23.5968     3.4995  -6.743 2.19e-11 ***
gestationalage   0.7861     0.1242   6.328 3.25e-10 ***
TreatmentDCC     0.7187     0.2909   2.471   0.0136 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
No. of observations in the fit:  1553 
Degrees of Freedom for the fit:  10
      Residual Deg. of Freedom:  1543 
                      at cycle:  13 
 
Global Deviance:     4870.737 
            AIC:     4890.737 
            SBC:     4944.217 
******************************************************************

Stepwise selection

wp(m.step)

Selected model

Variable Parameter
\mu \sigma \nu \tau=\pi
Treatment \checkmark \checkmark
Gestational age \checkmark \checkmark \checkmark \checkmark

Smooth terms

  • We didn’t include smooth terms for gestational age in stepwise selection

  • We can do this manually

LR test for pb(gestationalage) in tau

mZISICHEL2 = gamlss(transfusions ~ Treatment + gestationalage,
            sigma.fo=~gestationalage,
            nu.formula =~ gestationalage,
            tau.formula = ~Treatment + pb(gestationalage),
            family=ZISICHEL, 
            data=APTSsubset, n.cyc=100, trace = FALSE)
LR.test(m.step, mZISICHEL2)
 Likelihood Ratio Test for nested GAMLSS models. 
 (No check whether the models are nested is performed). 
 
       Null model: deviance= 4870.737 with  10 deg. of freedom 
 Altenative model: deviance= 4867.573 with  11.22205 deg. of freedom 
 
 LRT = 3.163794 with 1.222045 deg. of freedom and p-value= 0.1000901 
  • p = 0.10 \rightarrow no significant difference between models \rightarrow keep linear term

Smooth terms

LR test for pb(gestationalage) in mu

mZISICHEL3 = gamlss(transfusions ~ Treatment + pb(gestationalage),
            sigma.fo=~gestationalage,
            nu.formula =~ gestationalage,
            tau.formula = ~Treatment + gestationalage,
            family=ZISICHEL, 
            data=APTSsubset, n.cyc=100, trace = FALSE)
LR.test(m.step, mZISICHEL3)
 Likelihood Ratio Test for nested GAMLSS models. 
 (No check whether the models are nested is performed). 
 
       Null model: deviance= 4870.737 with  10 deg. of freedom 
 Altenative model: deviance= 4864.456 with  11.47356 deg. of freedom 
 
 LRT = 6.281552 with 1.473559 deg. of freedom and p-value= 0.02411701 
  • p = 0.024 \rightarrow significant difference between models \rightarrow use smooth term
AIC(m.step, mZISICHEL3)
                 df      AIC
mZISICHEL3 11.47356 4887.403
m.step     10.00000 4890.737
BIC(m.step, mZISICHEL3)
                 df      BIC
m.step     10.00000 4944.217
mZISICHEL3 11.47356 4948.763

Fitted term plots

library(gamlss.ggplots)

fitted_terms(m.step, what="mu")

fitted_terms(m.step, what="tau")

Predicted distribution

at gestational ages 26.5 and 29 weeks (approx 1st and 3rd quartiles)

Code
library(ggplot2)
library(distreg.vis)
library(gridExtra)

gest <- c(26.5, 29)

covariate_data <- structure(list(Treatment = structure(2:1,
                                                       levels = c("ICC", "DCC"), class = "factor"),
    gestationalage = c(gest[1], gest[1])), class = "data.frame",
    row.names = c("DCC", "ICC"))
pred_data <- preds(model = mZISICHEL2, newdata = covariate_data)
plot_dist(model = mZISICHEL2, pred_params = pred_data,
    type = "pdf", palette = "Paired") +
  lims(x=c(-0.5,10)) +
  labs(title=paste0("a. Gestational age = ", gest[1], " weeks"))+ 
  guides(fill=guide_legend(title='Treatment')) +
  theme(text = element_text(size=8)) + 
  theme(legend.position="bottom") -> g1

covariate_data <- structure(list(Treatment = structure(2:1,
    levels = c("ICC", "DCC"), class = "factor"),
    gestationalage = c(gest[2], gest[2])), class = "data.frame",
    row.names = c("DCC", "ICC"))
pred_data <- preds(model = mZISICHEL2, newdata = covariate_data)
plot_dist(model = mZISICHEL2, pred_params = pred_data,
    type = "pdf", palette = "Paired") +
  lims(x=c(-0.5,10)) +
  labs(title=paste0("b. Gestational age = ", gest[2], " weeks"))+ 
  guides(fill=guide_legend(title='Treatment')) +
  theme(text = element_text(size=8)) -> g2

#extract legend
#https://github.com/hadley/ggplot2/wiki/Share-a-legend-between-two-ggplot2-graphs
g_legend<-function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)}

mylegend<-g_legend(g1)

g3 <- grid.arrange(arrangeGrob(g1 + theme(legend.position="none"),
                         g2 + theme(legend.position="none"),
                         nrow=1),
             mylegend, nrow=2,heights=c(10, 1))

distreg.vis

library(distreg.vis)

vis()

References

Rigby, R.A., Stasinopoulos, D.M., Heller, G.Z., & De Bastiani, F. (2019). Distributions for modeling location, scale, and shape: Using GAMLSS in R. Boca Raton: Chapman & Hall/CRC.
Stadlmann, S., & Kneib, T. (2022). Interactively visualizing distributional regression models with distreg. vis. Statistical Modelling, 22(6), 527–545.
Stasinopoulos, D.M., Rigby, R.A., Heller, G.Z., Voudouris, V., & De Bastiani, F. (2017). Flexible regression and smoothing: Using GAMLSS in R. Boca Raton: Chapman & Hall/CRC.
Stasinopoulos, M.D., Kneib, T., Klein, N., Mayr, A., & Heller, G.Z. (2024). Generalized additive models for location, scale and shape: A distributional regression approach, with applications (Vol. 56). Cambridge University Press.
Tarnow-Mordi, W., Morris, J., Kirby, A., Robledo, K., Askie, L., Brown, R., … Simes, J. (2017). Delayed versus immediate cord clamping in preterm infants. New England Journal of Medicine, 377(25), 2445–2455. DOI: 10.1056/NEJMoa1711281